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Summary. The choice of the summary statistics in Bayesian inference and in 
particular in ABC algorithms is paramount to produce a valid outcome. We de- 
rive necessary and sufficient conditions on those statistics for the corresponding 
Bayes factor to be convergent, namely to asymptotically select the true model. 
Those conditions, which amount to the expectations of the summary statistics to 
asymptotically differ under both models, are then usable in ABC settings to deter- 
mine which summary statistics are appropriate, via a standard and quicl< Monte 
Carlo validation. 



1. Introduction 



1.1. Summary statistics 



In Robert et al. (2011 ), the authors showed that the now popular ABC (approxi- 



mate Bayesian computation) method (Tavare et al. 1997 Pritchard et al. 1999 



Toni et al. 2009 Marin et al. 2011) is not necessarily validated when applied 



to Bayesian model choice problems, in the sense that the resulting Bayes factors 
may fail to pick the correct model even asymptotically. The ABC algorithm is 
getting more and more accepted as a component of the Bayesian toolbox for han- 
dling intractable likelihoods. Since ABC is not the central topic of this article, 
but rather both a motivation and an immediate application domain, we do not 
embark upon a complete description of its implementation, referring to |Marin| 
et al. (2011 1 and Fearnhead and Prangle (2012) for details. We simply recall here 
that the core feature of this approximation technique is to run simulations {9, z) 
from the prior distribution and the corresponding sampling distribution until a 
statistic T{z) of the simulated pseudo-data z is close enough to the correspond- 
ing value of the statistic T(y) at the observed data y. The degree of proximity 
(also called the tolerance) can be improved by an increase in the computational 
power. However the choice of the statistic T is particularly crucial in that the 



resulting (approximately Bayesian) inference relies on this statistic and only on 
this statistic. It thus impacts the resulting inference much more than the choices 
of the tolerance distance and of the tolerance value. 



When conducting ABC model choice (Grelaud et al. 2009 Robert et al. 



2011), the outcome of the ideal algorithm associated with zero tolerance is the 



Bayes factor 



/^i(0i)fff(T(y)|0i)d0i 
/ ^2(^2)52^ (T(y)|02)d02 



which unsurprisingly is the Bayes factor for testing COTi versus 9JT2 based on the 
sole observation of T(y). This value most often differs from the Bayes factor 
-Si2(y) based on the whole data y. As discussed in Didelot et al. (2011) and 



Robert et al. (2011 ), in the specific case when the statistic T'(y) is sufficient for 
both 97li and 9)12, the difference between both Bayes factors can be expressed 
as 



5i2(y) = 



My) 
h2{y) 



(y): 



(1) 



where the ratio of the hi (y) 's often behave like likelihoods of same order as the 
data size n. The discrepancy revealed by the above is such that ABC model 
choice cannot be trusted without further checks. Indeed, even in the limiting 
ideal case, i.e. when the ABC algorithm uses an infinite computing power to 
achieve a zero tolerance, the ABC odds ratio does not take into account the 



features of the data besides the value of T(y). Robert et al. ( 2011 ) warn that this 
difference can be such that i3^(y) leads to an inconsistent model choice. (The 
same is obviously true for point estimation, e.g. when considering the special case 
of an ancillary summary statistic T(y).) This is also the reason why Ratmann] 



et al. (2009, 2010) consider the alternative approach of assessing each model 



under several divergence measures, defining a new algorithm they denote ABC^. 

Beyond ABC applications, note that many fields report summary statistics 
in their publications rather than the raw data, for various reasons ranging from 
confidentiality to storage, to proprietary issues. For instance, a dataset may be 
replaced by several p- values, Pi(y), against several specific hypotheses. Handling 
a model choice problem based solely on T(y) — (pi(y), ■ ■ ■ ,Pk{y)) is therefore a 
relevant issue, with the coherence of the corresponding Bayes factor at stake. 
Another relevant instance outside the ABC domain is provided in [Dickey and| 



Gunel (1978), who exhibit the above differences in the Bayes factors when using 



a non-sufficient statistic, including an example where the limiting Bayes factor, 
as the sample size grows to infinity, is or 1. Similarly, Walsh and Raftery| 



(2005) compare point processes via Bayes factors based on summary statistics. 



They discuss those summary statistics (second order statistics and based on 
Voronoi tesselations) based on the misclassification rates of the corresponding 
Bayes factors through a simulation study. However, the connection with the 
genuine Bayes factor is not pursued. (A connection with the ABC setting appears 
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in the conclusion of the paper, though, with a reference to |Diggle and Gratton 



(1984) which is often credited as an originator of the method.) 

The purpose of the current paper is to study asymptotic conditions on the 
statistic T under which the Bayes factor for testing Sti versus 9JI2 based on the 
sole observation of T(y) either converges to the correct answer or not. We obtain 
a precise characterisation of consistency in terms of the limiting distributions of 
the summary statistic T(y) under both models, namely that the true asymptotic 
mean of the summary statistic T'(y) cannot be recovered under the wrong model, 
except for nested models. As explained in the paper, this characterisation implies 
that using point estimation statistics as summary statistics is rarely pertinent 
for testing. The main result shows that a practical choice of summary statistics 
providing convergent model choice is available for ABC algorithms. The practical 
side is computational in that the mean values of the summary statistics can be 
checked by simulation. Further properties of the vector of summary statistics can 
also be tested via these simulations, including the comparison of several summary 
statistics or, equivalently, the selection of the most discriminant components of 
the above vector. 



1.2. Insufficient statistics 

The above connection between the Bayes factor based on the whole data y and 
the Bayes factor based on the summary T(y) is only valid when the latter is 
sufficient for both models. In this setting, and only in this setting, the ratio of 
the hi 's in ([T]) is equal to one solely when the statistic T is furthermore sufficient 
across models 9Jli and 93t2, i-e. for the collection {m,9m) of the model index 
and of the parameter. A rather special instance where this occurs is the case of 



Gibbs random fields (Grelaud et al. 2009). Otherwise, the conclusion drawn on 



r(y) necessarily differs from the conclusion drawn on y. The same is obviously 
true outside the sufficient case, which implies that the selection of a summary 
statistic must be evaluated against its performances for model choice, because 
it is not guaranteed per se. The following example illustrates this point: 



Example 1 . To illustrate the impact of the choice of a summary statistic on 
the Bayes factor, we consider the comparison of model 9Jli: y ^ J^{di, 1) with 
model 97I2: y ^ £(^2, l/v^), the Laplace or double exponential distribution 
with mean 62 and scale parameter 1/^/2, which has a variance equal to one. 
(Since it is irrelevant for consistency issues, we assume throughout the paper 
that the prior probabilities of both models SUti and 9JT2 are equal to 1/2.) 

In this formal setting, four natural statistics can be considered (as suggested 



by one referee of Robert et al. 2011 1 : 



(a) the sample mean y; 

(b) the sample median med(y); 

(c) the sample variance var(y); 
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(d) the median absolute deviation mad(y) = mcd(|y — nied(y)|); 



Given the models under comparison, the first statistic is sufficient only for the 
Gaussian model, the second statistic is not sufficient but its distribution depends 
on 6i in both models, while both the sample variance and the median absolute 



deviation are ancillary statistics. As explained later (Section 2.2), the most 
important feature of those statistics is that the first three statistics have the 
same expectation under both models (using appropriate values of the 9iS under 
both models) while the median absolute deviation has a different expectation 
under model 1 and model 2. 

Since we are facing standard models in this artificial example, the analytic 
computation of the true Bayes factor would be possible, even in the Laplace case. 
However, if we base our inference only on one or several of the above statistics, 
the computation of the corresponding Bayes factors requires an ABC step. Fig. 
[l] shows the distribution of the posterior probability that the model is normal 
(as opposed to Laplace) when the data is either normal or Laplace and when 
the summary statistic in the ABC algorithm is the collection of the first three 
statistics above. The outcome is thus that the estimated posterior probability 
has roughly the same predictive distribution under both models, hence ABC 
based on those summary statistics is not discriminative. Fig. [2] represents the 
same outcome when the summary statistic used in the ABC algorithm is only 
made of the median absolute deviation of the sample. In this second case, the 
two distributions of the estimated posterior probability are quite opposed under 
each model, concentrating near zero and one as the number of observations n 
increases, respectively. Hence, this summary statistic is highly discriminant for 
the comparison of the two models. From an ABC perspective, this means that 
using the median absolute deviation is then satisfactory, as opposed to the first 
three statistics. < 

The above example illustrates very clearly the major result of this paper, 
namely that the mean behaviour of the summary statistic T(y) under both 
models under comparison is fundamental for the convergence of the Bayes factor, 
i.e. of the Bayesian model choice based on T(y). This result, described in the 
next section, thus brings an almost definitive answer to the question raised in 



Robert et al. (2011 1 about the validation of ABC model choice, although it may 
require additional simulation experiments in realistic situations. 

The paper is organised as follows: Section [2] contains the theoretical deriva- 
tion of the asymptotic behaviour of the Bayes factor based on a summary 
statistic. Section [2.1| covering our main assumptions. Section [2. 1| exhibiting the 



asymptotic behaviour of the marginal likelihoods. Section 2.2 detailing the con- 
sequences of this result for model choice based on summary statistics. Section |3] 
illustrates the relevance of our criterion for evaluating summary statistics. Sec- 
tion [4] details the practical implementation of a validation mechanism based on 
the above results, with a population genetic example. Section [5] concludes the 
paper with a short discussion. 
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Fig. 1. Comparison of the distributions of the posterior probabilities that the data is from 
a normal model (as opposed to a Laplace model) with unknown mean d when the data is 
made of n = 10, 100, 1000 observations (left, centre, right, resp.) either from a Gaussian 
or Laplace distribution with mean equal to zero and when the summary statistic in the 
ABC algorithm is the vector made of the collection of the sample mean, median, and 
variance. The ABC algorithm uses a reference table of 10" simulations (5, 000 for each 
model) from the prior 9 ~ Af{0, 4) and selects the tolerance e as the 1% distance quantile 
over those simulations. 

2. Convergence of Bayes factors using summary statistics 

Let y ~ [yi, . . . ,yn) be the observed sample, not necessarily iid. We de- 
note by y ~ P" the true distribution of the sample, and by T(y) = T" = 
{Ti{y) ,T2{'y) T ■ ■ ■ ,Td{y)) a c?-dimensional vector of summary statistics, T" ~ 
G„. The distribution G„ is the projection of P" under the map T" : M" ^ 
and we denote its density by 

There are two competing models OJti and 9712 that we wish to compare: 

- under Mi, y - Fi,n{-\9i) where Oi e Oi C 

- under 0^2, y i^2,n(-|6'2) where 6*2 G 62 C W'' 

The distributions of T" under dJli and DJI2 are denoted by Gi.„(-|6'i) and 
G2,n('|^2), respectively. We also assume that the distribution functions Fi n{-\Oi), 
Gi,n{'\di) have densities fi{-\9i) and gi{-\9i) with respect to some dominating 
measures fii^x and /ij^T ~ 1,2), respectively. Under the respective prior 
distributions tti and tt2 on 9i and 62, the posterior distributions given T" are 
denoted by 7ri(-|T") and 7r2(-|T"). 

2. 1 . Assumptions and asymptotic beliaviour of tfie marginal likelihoods 

Before stating the main result in the paper, we detail theoretical assumptions on 
both the models and the summary statistics under which the main result holds. 



n=10 



n=100 



n=1000 



Gauss Laplac9 



Fig. 2. Same legend and same calibration as in Fig. [ijwhen the ABC algorithm is based 
on the median absolute deviation of the sample as its sole summary statistic. 



We start with a brief primer on our notations. The letter C denotes a generic 
positive constant (with respect to n), whose value may change from one occur- 
rence to the next, but is of no consequence. We write a A 6 to denote min(a, h). 
For two sequences {a„}, of real numbers, a„ < 6„ (resp. >) means a„ < C&„ 
(resp. a„ > C6„). Similarly, a„ ~ &„ means that 

1/C < liminf |a„/6„| < limsup |a„/6„| < C . 



The symbol denotes convergence in distribution. 

Technical assumptions that are necessary for establishing the main result of 
the paper are as follows: 

([Al]) There exist a sequence of positive real numbers converging to +00, a 
distribution Q on M'', and a vector ^W^, such that 

t;„(T" - ^0) Q, under Gn ■ 

([A2]) For i e {1,2}, there exist sieves J^ns C Qi and constants ei,Ti,ai > 0, 
such that 



and, for all 6i E Tn,i, there exists ^i{9i) G such that for all r > 0, 



sup 



< 



G,^r.\\r'^^i^{e,)\>T\^i,{6,)-^i^\^e 

We define the sets Sn/i C Fn.i {i G {1, 2}) as 
We say that 2Hi is compatible with T" if 

inf{|/i,(0,)-Aio|;e«eej = O, 



(2) 



(3) 
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meaning that the asymptotic mean of T" is found within the range of the means 
of T" in model M,. 



([A3]) If SDTi is compatible with T", then there exists a constant d,; < A (ai — 1) 
such that 



(4) 



where Sn,i{u), Ti and ai are defined in assumption [A2] 



([A4]) If 9Jli is compatible with T", then for any e > there exist U,5 > Q and a 
set En such that for all 9i £ Sn,i{U) 



En C {t; g,{t\e,) < Sgn{t)} and G„ < e . 



(5) 



Even though these assumptions might look overwhelming, we claim that 
Al ]- A4 are both mild and relatively easy to check in applications. Here, 



we discuss briefly what such assumptions mean, a more detailed discussion is 



provided in Section 2.3 Furthermore, we will later (Section 3.1) illustrate why 
they hold in the Gaussian versus Laplace example. 

Requiring that T" concentrates at a certain speed w„ as n increases is only 
natural since the summary statistics are here to replace the whole data set and 
thus convey enough a fraction of the data information. (It however also applies 
to some ancillary statistics.) Assumption | [A2]] essentially states that we control 
deviations from the mean for the statistic T" when the mean ^ii{9i) under 971, 
is close to the 'true' asymptotic mean /^o- Thus, when the model 971^ is not 
compatible with T", i.e. when vai{\ni{0i) — /iol;^^ e <di] > 0, this condition 
merely states that under model 971^, the probability that T" deviates from its 
mean is bounded by w,7"'. Note that if inf{|/ii(0i) — ^Q\;9i G 6^} > 0, then we 
can replace condition | [ A2] | by 



\A2'] 



G,, 



< Cv- 



for some r, C > 0. Ways of checking such conditions are given in Section |2.3[ 
Note that Assumptions |[A3]] and [ "A4]| arc only required if model DJli is com- 
patible with T", hence in a case where T" comes from model 9Jlo which is 
incompatible with say 97li, then to control the marginal distribution of T" un- 
der 9Jti the only conditions that are required arc [ Al]] and [ A2]| Assumption 
[A3] describes behaviour of the prior distribution of the mean of T" near the 
true asymptotic value /xq. Assumption |[A4]| states that for values of 9i such 
that ^i{9i) is close to /zq then the density gi{t\9i) is bounded from above by 
0{gn(t)) on a set containing most of the mass under G„. Although this is a 
rather technical assumption, it is not so much demanding and it holds for some 
discrete as well as continuous summary statistics. More importantly, we stress 
that assumption |[A4]| almost holds de facto if model 9Jli is true since f/„ is then 
equal to gi{-\9o) for a specific 9q. More details are provided in Section 2.3 
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The following result provides a fundamental control on the convergence rate 
of the marginal likelihoods. In Lemmajl] mi(-) and m2(-) denote the marginal 
densities of T" under models and 9Jt2, respectively, namely (i = 1,2) 

m,{t)= [ g,{t\9,)7:,i9,)d9,. (6) 



Lemma 1. Under assumptions [Al] - [A4] for « = 1, 2, there exist constants 
Ci, Cu — Orn{l) such that, ifdJli is compatible with T", and > di, ai > di + 2, 



Civ-J^ < < C^v-'^ (7) 

9n(T ) 

and, ifWli is not compatible with T", 

Civ-''^ < -%J- = op. [v-^' + v-"^]. (8) 

The above lemma, or more precisely ([T]), provides an equivalence result for 
the marginal densities mi(T^) when /ig G {t^i{di),di € 6i} but it does not 
specifically require that G„ belongs to model Wli. Appendix 2 details the proof 
of Lemma [l] The following result is a corollary on the use of T" for estimation 
purposes other than model choice: 

Corollary 1. Under the assumptions of Lemma ^ if dJli is compatible 
with T", the posterior distribution 7ri(.|T") concentrates at the rate on 
{Of, ij,i{0i) — fio}, provided > di and a-i > di + 2. Hence, under the posterior 
distribution 7ri(.|r"), fJ.i{9i) converges to /iq o,t the rate l/w„. 



Proof. Equation ([7| of Lemma [T] yields that 

5„(T") - " 



with large probability. For all sequences {wn}n converging to +oo, calculations 
performed in the proof of Lemma [l] (see Appendix 2) yield that with probability 
going to 1 under G„, 

Js„,,{w„y 9n[-l- ) 

Therefore the posterior distribution of fii{9i) has its tail probability given by 
and the corollary follows. □ 
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Lemma [T] helps in understanding the meaning of the parameter di in assump- 
tion [[A3j| when DJli is compatible with T". Indeed, we then have 



m,(r") 



.-di 



thus log(mi(T")/g„(T")) ~ —di logi;„ and v~'^^ appears as a penalization factor 
resulting from integrating 6i out in the very same spirit as the effective number 
of parameters appears in the DIG ( Spiegelhalter et aTj [2002 ) criterion and in the 
discussions in 'Rousseau (2007) and 'Rousseau and Mengersen (2011). In regular 
models, di corresponds to the dimension of leading to the usual BIG 

approximation; however, in non-regular models, which may occur with the kind 
of applications where ABG methods are required, di can be different. This is 
illustrated in the examples in Section [3j We now present the major implication 
of these results on the relevance of some summary statistics to compute Bayes 
factors. 



2.2. Bayes factor consistency 

Lemma [T] implies that the asymptotic behaviour of the Bayes factor is driven by 
the asymptotic mean value of T" under both models. It is usual to assume that 
one of the competing models is true, when studying the behaviour of testing pro- 
cedures (here posterior probabilities and Bayes factors). Here, in full generality, 
it is actually enough that one of the models is compatible with the statistic T". 
Hence, without loss of generality we assume that the true distribution belongs 
to model 9}ti and we first consider the case where model 2Jt2 is also compatible 
with T", i.e. 

M{\no-l^2{92)\;02 e 62} -0. 



Under assumptions [Al] [A4][ 



C,v^i'li-d2) < -(d.-d^) 

where Ci,Cu — Opn(l), irrespective of the true model. Thus the asymptotic 
behaviour of the Bayes factor depends solely on the difference di — d2. For 
instance, if di > ^2 (as in the embedded case) and Gn is in , the Bayes factor 
goes to 0, instead of infinity. If instead di = ^2, the Bayes factor is bounded 
from below and from above and is thus useless to separate the two models. Note 
that the asymptotic (non-convergent) behaviour remains the same even when 
Gn is in neither model, provided 

inf{|Mo - f^2i02)\; 02^02} = inf{|/^o - /ii(0i)|; G 61} = . 

On the contrary, assume that the true distribution is in model COti and that 
model 2^2 is not compatible with T", then the Bayes factor, under assumptions 
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[Al] [A4] satisfies 



mi(r") 
and if niin(Q;2, T2) > di, 



> Ct min ( ^-(di-rs) 

m2[T ) V 



lim — ^ = +00, 



which will lead to choosing the correct model asymptotically. The above then 
implies the following consistency result, which is the core derivation of our paper, 
providing a characterisation of relevant summary statistics: 



Theorem 1. Under assumptions \[Al]\ - \ [A4] if 



inf{|Mo -M2(e2)|;e2 e 62} = inf{|Mo e = 0, 

and TiAai — 2 > di, then the B ayes factor BJ2 ^'^■s same asymptotic behaviour 
as Vn '■'^^ '^^^ irrespective of the true model. Therefore, it always asymptotically 
selects the model having the smallest effective dimension dj. 

// the true distribution Gn belongs to model 97li and if /ig cannot be repre- 
sented in model 9Jt2, 

- inf{|Mo - 01 e Oi} < inf{|/io - ^^2(^2)1; e 62} , 

and i/min(a2,T2) > di, then the Hayes factor is consistent. 

An essential practical consequence of Theorem [T] is that the Bayes factor 
is merely driven by the means jJiiiOi) and the relative position of /ip in both 
sets {fii{9i);9i S 6^}, i = 1,2. If Gn is in neither model but ^0 belongs to 
{fii{6i),9i e 9i} and not to {^2(^2), ^2 & ©2}, then the Bayes factor will 
asymptotically favour SlJli. Note that the result does not cover the behaviour of 
the Bayes factor when neither model is compatible with T", since there is no 
simple characterisation in this case. 

The following heuristic argument sheds some light on why the above results 
hold. 

Suppose the summary statistics (appropriately rescaled) are asymptotically 
normal under each models and assume that the KuUback-Leibler divergence 
between the distributions y/n{T"' — ^i{9i)) can be approximated by the KuUback- 
Leibler divergence between respective asymptotic Gaussian distributions: 

and 
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where Vq, Vi{9i), and ^2(^2) denote the asymptotic variances under the various 
models, where \V\ denotes the determinant of the matrix V, and where qg is the 
pdf of the standard Gaussian distribution. Then 

-KL{gn{T ),gr[T \0^)\ ^ +0 I , (9 

n 2 

in that case a usual Laplace argument would imply that 



<?n(T") 



So that the difference between /ig and fJ.i{Oi) is the key measure to evaluate the 
distance between gn and gi,n{:\0i)- The above argument is purely illustrative 
since requiring ^ is very strong and not realistic in most cases. 

Interestingly, the best statistics T" to be used in an ABC approximation 
to a Bayes factor are ancillary statistics with different expectations under both 
models. Indeed, if T" depends asymptotically on some of the parameters of one 
of the models, say model DJli, then it is quite likely that there exists 62 S 62 
such that /i2(^2) = Mo even though model 9JI2 is misspecified, in particular if 
d, the dimension of T", is at most the dimension of 92- To expand on this 
remark, consider the case where d = 1 and {fii{9i),9i G 9i} = E (or a large 
enough interval). In this case, T" is not a satisfactory statistic for discriminating 
between models 9Jli and DJI2, when 9JI2 is the wrong model. Consider Example 
[1] of the Laplace versus Gaussian distributions with 

n 

i=l 

and assume that the true distribution is the Laplace with mean 1, so that fj,o = 
13. Since under the Gaussian model /i2(^) — 9* + 3 + 69^, the value 9* = 
\J \/T9 — ~3 leads to [Iq = /i2 (9* ) and a Bayes factor associated to such a statistic 
is not consistent (here d^ = ^2 = !)■ Figure |3] illustrates this lack of consistency 
on 100 samples of size 1000 simulated from both Gaussian and Laplace models 
with mean 1. In that experiment, the distributions of the posterior probabilities 
under both models are quite similar and hence non-discriminant. Interestingly, if 
the true mean is zero instead, the Bayes factor becomes consistent, as discussed 
below in Section [XT] 

However if T" is ancillary (asymptotically), {/ii(0i),0i e 81} is a singleton 
and it is sufficient that this singleton is different from /ig. These remarks are 
illustrated in Section [3l 

In the special case of 9Jti being a submodel of 9712, and if the true distribution 
belongs to the smaller model 9Jti, any summary statistic satisfies 

Mo € {mi(^i);^?i e ej c {ii2{92)\92 e 62}, 
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n^lOOO 



Fig. 3. Comparison of the distributions of ttie posterior probabilities that the data is from 
a Gaussian model (as opposed to a Laplace model) with unknown mean parameter 
when the data is made of n = 1000 observations either from a Gaussian or Laplace 
distribution with mean equal to one and when the summary statistic in the ABC al- 
gorithm is restricted to the empirical fourth moment. The ABC algorithm uses 10* 
proposals (5, 000 for each model) from the prior J\f(Q, 4) and selects the tolerance as the 
1% distance quantile. 



so that the Bayes factor is of order '■'^^ If the summary statistic is in- 

formative merely on a parameter which is the same under both models, i.e., if 
di — d2, then the Bayes factor is not consistent. Else, di < d2 and the Bayes 
factor is consistent under DJti . If the true distribution does not belong to OJli , 
then the same phenomenon as described above occurs and the Bayes factor is 
consistent only if /Xi 7^ = /^o- This case will be illustrated for a quantile 



distribution in Section 3.2 We now discuss the assumptions [Al] [A4] 



2.3. Further comments on the assumptions [Al] - [A4] 

Condition |[A1]| is quite natural. It is often the case that summary statistics 
T" are chosen as empirical versions of quantities of interest (under second order 
moment conditions) and it is natural to assume that they concentrate since they 
are chosen to be both low dimensional and informative on some aspects of the 
model (even though the result also applies to ancillary statistics). For instance, 
when the summary statistics are empirical means or empirical quantiles, condi- 

^/n with the Gaussian distribution being the 



[Al] is satisfied with v„ 



tion 

limiting Q (a most common occurrence). However, if T" is a distance (e.g., 
of the type induced by chi-square like statistics) then Q will be the chi-square 
distribution. 

This convergence is only required under the true distribution Gn. Still, con- 
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dition [A2] implicitly requires that under each model T" concentrates around 
the model mean values fj-i{9i) at the rate Vn, even though i t is no t necessary 
to have convergence in distribution. More precisely, condition |[A2] controls the 
large deviations of the estimator T" from the estimand ^{6) under each model. 
For instance, when T" is an empirical mean, i.e., T" = X]"=i ^iUi) ^'^^ ^ 
given function h, Markov's inequality implies that for every Oi G Qi, 



E[|Er=i{%o-M.(^.)}je. 



< K{9i)u 



(10) 



for large values of p and under very weak assumptions (much weaker than 
being in an i.i.d. setting). The main difficulty in this condition comes from 
the fact that, for our arguments to go through, the factor K{9i) in (10) must 
be controlled uniformly in 6. This is obviously much easier if the parameter 
space is compact. Otherwise, this control can still be achieved by choosing a 
power Q!j that is smaller than p in the following way: consider ^^'s such that 
— Mol < for some positive e, assuming that /iq € {^J'iiOi)',0i G ©i} and 
u = \/n\fii{9i) — /io| > 1 (otherwise we bound the above probability by 1), then 



( 10 1 implies that 



provided K{6i)\fJ,o — Mil^j)! — "-^^ "''^^ on J^n,i- Furthermore, if 6^ is not 

compact, we usually have 

sup K{9i) — oo . 

In such situations we use the sieves Tn,i (which typically are compact subsets of 
Qi) to recover uniform bounds on the constant K{9i) in ( 10 ). On the complement 



of the sieves J-n,i, we need the additional assumption that the tail probability of 
the prior distribution decays sufficiently quickly for large n. This argument is 
illustrated in the Gaussian versus Laplace example, which is detailed in Section 

EH 



Condition |[A3]| is a condition on the prior distribution under either model, as 
often encountered in asymptotic analyses of the posterior distribution, see for 
instance condition (2.5) of Theorem 1 in Ghosal and van der Vaart (2007). 
Usually referred to as the prior mass condition, it corresponds to the fact that 
if the prior vanishes in regions where the likelihood is not too small (i.e., near 
fiQ in our case) then the marginal becomes very small. The exponents di can be 
viewed as effective dimensions of the parameter under the posterior distributions, 
as discussed after Corollary [l] If the maps 9i i— >■ Hi{9i) are locally invertible near 
/iOj under the usual continuity conditions on the maps 9i H> |/io ~ for 
any u > 0, there exists a finite collection of points 6'*- e such that the sets 
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Sn,iiu) can bounded both from above and below by sets of the form 
J 

[Ji^r-\(^j~(^h\^^^n'}, ^eN. (11 
Thus if the prior density tt^ is bounded from above and below near the points 9, 



we immediately deduce that TTi{Sn,i{u)} ~ u'^v~'^ and di = d, verifying [A3] In 
most cases we will have di < d, since assuming that di > d would imply that the 
prior density of ^{9) explodes at fiQ- 

Condition |[A4] [ states that, if there are 0i's such that fj,i{9i) — then uniformly 
in 9 close to one of those 6'i's, gi{t\9i) is bounded from above by gn{t) on a set 
having large probability in terms of G„. There are various instances under 
which this assumption can be satisfied. In particular, if ?;„(T" — jii) converges 
in distribution to Q and if the densities are close, then condition |[A4]| is satisfied. 
It requires in particular that T" has the same support under G„ and Gi, but 
does not necessarily require G; or G„ to be continuous distributions. Condition 
[A4]| may become difficult to check when the sets Sn,i{u) are not compact, 



which is typically the case when the sets {9i; iii{9i) = /Uq} are not compact. The 
important point to note here is that, in such cases, the posterior distribution 
7ri(-|r") is informative not on the whole parameter 9i (at least no more than 
the prior) but only on a fraction of it, summarised by ^i{9i). In such a case, 
for condition |[A4]| to be nonetheless verified, it is important to impose tail 
conditions on the prior so that the sieves J-n,i are not too large or to ensure 
that the distributions Gt^n of T" do not depend on 9i, another occurrence of the 
ancillarity aspect. 



3. Illustrations 

3. 1. Gaussian versus Laplace distributions 

In the setting of Example [l] we denote by DJli the Gaussian model and by 
2Jt2 the Laplace model. In each model, the prior on the mean 9i is a centred 
Gaussian distribution with variance 4 and in each case the data are simulated 
under 9i = 0. For illustrating our main result on consistency, we now consider 
the more manageable summary statistics made of 

• the empirical fourth moment : T" — n^^ ^^=i vt ■ that case, 

^JLl{9)=9^ + 2, + &9'' , fi2i9) = 9* + 6 + 69\ 

• the sixth moment : T" = n^^ J27=i Vi- that case 

Ml (6*) = 61^ + 15 + 456*2 + 159^ , ^2(6') = 9^ + 90+ 159^ + 909^ . 

• both sixth and fourth empirical moments: T" = X]r=i(2^i'' )' then 
the means are the same as before and T" converges in distribution under 
each model to a Gaussian random vector. 
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We now endeavour to check that assumptions | [ Al] 1 1 [A4] hold for those statis- 
tics. Given that they are empirical moments, condition |[A1] is trivially satisfied 
as a consequence of the Central Limit theorem, with ~ ^fn. 

For assumption I [A2]| is already defined above. For b oth models, we 

set = Tn.2 = < uy/\og 71 } = ^01 Condition [A2] to hold, where 

u > so that 

under a Gaussian prior on 9 under both models, which implies = The 
second part of condition [A2] is verified using Markov inequalities. First, for 
M > large enough, there exists cm such that Vi{9) < M implies that \0\ < cm- 
For instance, in the case of the fourth empirical moment, if \d\ < cm 



uniformly. On the other hand, if \9\ > cm, then there exists > such that 
\^J.ii0) - fJ.o\ > Si and 



G, 



smce, m J"„, W.^liY'^ - ^l,[e)Y\e] < Ci{\ogn)^. Thus, assumption [[A2]| is satisfied 
for every ai < A. 

Addressing | [A3] [ in model IJJli, in the case of the fourth empirical moment, 
if the mean is equal to zero, /zq = 3 (resp. 15 and (3, 15) for the other summary 
statistics) and, in model 9Jl2 if O2 — 0, /iq — 6 (resp. 90 and (6,90)). Then 
Sn,i{C) and Sn,2{C) can be bounded from above and below by balls of the form 



16*1 < cC^/^r 



^1/4 



so that c?i = = 1/2 in those cases. Otherwise, if the mean is different from 
zero, ^0 > 3 (resp. > 15) in model fHi and /iq > 6 (resp. > 90) in model 9JI2. 
Then Sn,i{C) and 5„.2(C) can be bounded from above and below by balls in the 
form 

\0'^ -Ol\<cGn~^''^, \e,\>Q 

so that di = (^2 = 1 in those cases. For the bi-dimensional summary statistic, 
provided the mean is different from zero, we have Sn,iiG) 7^ for n large enough 
only if 9Hi is compatible with T". 

If the true distribution belongs to model 9JI2 and the mean is equal to zero, 
/-io G E M.} for both i = 1,2 and we have di = 1 and d2 = 1/2. On the 
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other hand, if the true distribution belongs to model 9Jli (i.e., is Gaussian) then 
di = 1/2 and 

M{\fi„-ti2{02)\;02eR}>0. 

Following from Theorem [l] the Bayes factor is then consistent in both cases but 
at the rate n~^/'^ under model DJl2- This is to some extent an accidental result, 
merely due to the fact that, in that very special case when the mean equal to 
zero, di > d2- Once again, if the mean is different from zero, then a similar 
argument leads to the lack of consistency of the Bayes factor under model 9JI2, 
since then di = d2 = ^ and /iq € & € K}, for both i = 1,2. 

Since Y allows for any moment under both distributions, and since both 
distributions satisfy Cramer condition, T" allows for an Edgeworth expansion 
under both models, which can be made un iform in sets in the form {\0i\ < 



Cn~^/^}, see Bhattacharya and Rao (19861, Theorem 19.1. Hence condition 
|[A4]| is satisfied. 

Figures |4] and [5] illustrate the above discussion in the case when the mean 
is equal to zero. When using solely the empirical fourth moment, the posterior 
probability for the normal model is highly concentrated near 1 when the obser- 
vations are normally distributed, while the posterior probability for the normal 
model slowly decreases to zero with the number of observations when they are 
Laplace distributed. When using both the fourth and the sixth moments, the 
convergence (to zero) in the Laplace case occurs faster. We note that the dis- 
tance used for the latter case is an Euclidean distance with weights 1 and 1 / 100 
on the fourth and the sixth components, in order to compensate for the one- 
hundred-fold larger values of the square differences of the sixth moments. Using 
a regular Euclidean distance led to account only for the empirical sixth moment 
statistic, obliterating the impact of the fourth moment and hence the consistency 
property. 



3.2. Quantile distributions 

We now consider the example of a four-parameter quantile distribution, defined 
through its quantile function 



Qip;A,B,g,k)^A + B 



where z{p) is the pth standard normal quantile and the parameters A, B, g and 



k represent location, scale, skewness and kurtosis, respectively (Haynes et al 
1997). While the quantile function is well-defined, and the distribution easy 
to simulate, there is no closed-form expression for the corresponding density 
function, which makes the implementation of an MCMC algorithm quite delicate. 



Allingham et al. ( 2009 1 introduce a ABC procedure that uses the order statistics 



as summary statistics. We consider here a model choice perspective. 

In this experiment, we set A = and B — 1. We then oppose two models: 
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n=100 



n=1000 



n=10000 




Fig. 4. Comparison of the distributions of the posterior probabilities that the data is from 
a Gaussian model (as opposed to a Laplace model) with unknown mean parameter 
when the data is made of n = 100, 1000, 10000 observations generated from either from 
a Gaussian or Laplace distribution with mean equal to zero and when the summary 
statistic in the ABC algorithm is restricted to tlie empirical fourth moment. The ABC 
algorithm uses 10'' proposals (5, 000 for each model) from the prior A/'(0, 4) and selects 
the tolerance as the 1% distance quantile. 



• model 1 (9?ti), in which g = Q, with a single unknown parameter 9i = k 
and a prior 9i ^ 1/2,5]. In the simulation process, when 9Jti is true, 
we choose 6i ^ k = 2. 

• model 2 (9H2), with two unknown parameters 62 = (5, fc) and a prior 
O2 — {g,k) ~ W[0,4] (g) W[— 1/2, 5]. In the simulation process, when DJ12 is 
true, we choose 6*2. 1 = g = I and 6*2.2 = k = 2. 

This obviously is a case of embedded models, since Tti is a sub-model of DJl2- As 
in the previous experiments, we use an ABC procedure relying on 10* proposals 
from the prior and a tolerance set at the 1% quantile of the Li distances between 
some empirical quantiles. In the comparison below, we first use the empirical 
quantile of order 10% as sole summary statistic. Then, we consider the empirical 
quantiles of order 10% and 90%, and, at last, the empirical quantiles of order 
10%, 40%, 60% and 90%. The results are summarised in Figure |6] They show 
complete agreement with Theorem 1. 

When the summary statistic T" is restricted to the empirical quantile of 
order 10%, the Bayes factor is not consistent. Indeed, in such a case, we have 

/.i(0i) = (l + z(O.l)2)'^z(O.l), 

and 

When 9Jli is true with k — 2, then ^0 ~ —8.95 and inf{|^o — ^1(^*1)1; ^'i G Qi} — 
inf{|/io — l^2{02)\](^2 G ©2} = 0. Similarly, when 9Jt2 is true with g = 1 and 
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n=100 



n=1000 



n=10000 




Fig. 5. Same legend as in Fig. |4]when the ABC algorithm is based on both the fourth 
and sixth empirical moments as summary statistics. 



0. 



k — 2, then fiQ w —4.90 and 

inf{|Mo - Mi(^i)l;^i G 0i} = inf{|Mo - ^^2(^2)!; ^2 G Oa} 

Therefore, the Bayes factor has the same asymptotic behaviour as n^id-i-d2)/2 
irrespective of the true model We can prove that di = d2 = I in this case, 
therefore that the Bayes factor is not consistent. 

When the summary statistics T" is the vector made of the empirical quantiles 
of order 10% and 90%, the Bayes factor is consistent. Indeed, in such a case, we 
have 



and 



^2(^2) = 



) = (1 + zi0.1)Y" 2(0.1), (1 + ^(0.9)2)"^ z(0.9) 
1 - cxp(-6'2,i2(0.1)) 



1 + 0.i 



(l + z(0.1)2)"-z(0.1). 



1 + 0.8: 



(l + z(0.9)2)'^'^^(0.9) 



l+cxp(-02,i-2(O-l)), 

, l-cxp(-g2,i2(0.9)) ^ 
'l + exp(-6i2,i2(0.9))^ 
When Ml is true with k ^ 2, then fiQ ~ [-8.95, 8.95] and 

inf{|/io -Mi(^i)l;^i e ei} - inf{|Mo - ^^2(^2)1:^2 e 62} = 0. 

We can prove that di < d2 and hence that the Bayes factor is consistent. More- 
over, when C0T2 is true with g = 1 and k = 2, then /io ~ [—4.90, 12.99], and 

inf{|Mo-M2(e2)l;e2 ee2}-o but m^{\^io-^l2i9l)\■,Oleel}>o. 

We can prove that min(Q!i,Ti) > d2 and therefore that the Bayes factor is again 
consistent. 

FinaUy, when the summary statistics T" is the larger vector made of the em- 
pirical quantiles of order 10%, 40%, 60% and 90%, the Bayes factor is obviously 
consistent. However, the results obtained in Figure |6] are very similar to the ones 
obtained with only two empirical quantiles. 
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Fig. 6. Comparison of the distributions of the posterior probabilities that the data is from 
model Ml when the data is made of 100 observations (left column), 1000 observations 
(central column) and 10,000 observations ('r/gf/7fco/umn^ either from OTi (M1) or M2 (M2) 
when the summary statistics in the ABC algorithm are made of the empirical quantile at 
level 10% (first row), the empirical quantiles at levels 10% and 90% (second row), and 
the empirical quantiles at levels 10%, 40%, 60% and 90% (third row), respectively. The 
boxplots rely on 100 replicas and the ABC algorithms are based on lO" proposals (5, 000 
for each model) from the prior, with the tolerance being chosen as the 1% quantile on 
the distances. 



4. Checking for appropriate statistics 



4.1. A practical procedure 

While the above result operates in an asymptotic and theoretical framework, it 
is possible to derive a methodological implementation from this characterisation 
of consistent summary statistics. Indeed, once an ABC scheme is constructed for 
each of the two models under comparison, estimates of the model parameters, 
61 and 02, can be derived from the ABC reference tables produced for each 
model, using the available summary statistics or more advanced schemes as in 



Fearnhead and Prangle (2012). Given those ABC estimates of the parameters 



under both models, both expectations of a given summary statistic T{x) can be 
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approximated, namely 

First, this implementation is feasible since the prerequisite to running ABC 
under both models is that the simulation of a pseudo-sample is manageable. 
Second, the core requirement in our theoretical derivation is that the range of 
Eg,[T(x)] differs between models. Obviously, in the special case when T is an 
ancillary statistic under both models, checking for a different range only requires 
a single simulation run under each model. Otherwise, evaluating the means for 
the whole range of fl^'s under both models is too costly. However, given that 
there is a true mean hq = E°[r(x)] (where the superscript means the expectation 
is computed under the true distribution) , this true mean is asymptotically equal 
to E^^ [T(x)] if model 97li is compatible with T" (or, better, if model Tlx is 
the true model). Here as well, superscript means the expectation is computed 
under the corresponding model. Since this approximation is converging with 
the sample size, it is reasonable to restrict the evaluation of the expectation 
at this best possible choice, 9i. If model 9Jl2 is not the true model, there still 
exists a pseudo-true value that corresponds to the distribution within model 97l2 
that is closest to the true distribution of T(x). Once again, 02 is a convergent 
approximation of this pseudo-true value. Therefore, E| [T'(x)] is asymptotically 
the best approximation to the true mean E°[T(x)]. Rather than checking for 
the whole range of Eg^ [T'(x)] when 62 varies, we can therefore check whether or 
not El [T(x)] is equal to E? [r(x)], via samples obtained by simulation. The 
computation tool used for the implementation is therefore a simple t test on the 
means based on two samples of T(x)'s, one simulated under each model. The 
computational cost is therefore negligible compared with producing the ABC 
reference table. 

In the case of the normal versus Laplace toy problem, we ran a hundred 
ABC evaluations based on three and four statistics, empirical mean, empirical 
median, and empirical variance, without and with the empirical mad, using the 
same reference table with 10^ simulations in all cases. The two choices of the 
summary statistic vector led to two different ABC estimators under each model, 
9i and 62 respectively. For each of those estimators, we repeatedly simulated 
samples from /| and /| , respectively, with the same size as the original sample, 
producing a sample of T(x)'s, and we deduced from this sample a t test about the 
equality of the means. Figure [7| evaluates the impact of including the empirical 
mad within those summary statistics on the result of the t test. The result 
of this simulation experiment (based on 100 replications) is quite satisfactory 
in that in all cases the difference between the two empirical means falls within 



1=1 

^i[^w]-^E^(^2,), 

1=1 
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the null hypothesis acceptance interval when the mad is not included, therefore 
concluding on the inappropriateness of the summary statistics to conduct the 
ABC model comparison, and outside the null hypothesis acceptance interval 
when it is included. 



Fig. 7. Boxplot representation of the variability of the i-statistics comparing the mean 
of summary statistics under the normal and Laplace location models (lighter boxes for 
normal data and darker boxes for Laplace data). For each of the 100 points used for the 
boxplots, the data is made of n = 150 observations and the t-statistics are based on 
10* samples of the same size simulated using a location parameter estimated by ABC 
based on a summary statistic without (left) or with the mad statistic (right). The ABC 
algorithm uses a fixed reference table of 10^ proposals 50, 000 simulated from the prior 
6 ~ A/'(0, 4) and the respective model, and it selects the tolerance e as the 1% quantile 
of those simulation distances. 



4.2. Population genetics experiment 

We now consider a Monte Carlo experiment that more directly relates to the 
genesis of ABC, namely population genetics. As in [Robert et al. (20111, we 



consider two populations (1 and 2) having diverged at a fixed time t' in the past 
and a third population (3) having diverged from one of those two populations 
(models 1 and 2, respectively). Times are set to t' = 60 generations for the first 
divergence and t = SO generations for the second one. The effective population 
size is assumed to be identical for all three populations and equal to iVe = 
60. Recall that the effective size of a population is defined as the size of an 
ideal (Wright-Fisher) population that would show the same behaviour as the 
population of interest, in terms of loss of genetic variation due to random drift. 
We assume we observed 50 diploid individuals per population genotyped at 5, 
50 or 100 independent microsatellite loci, this number acting as a proxy to the 
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sample size. These loci are assumed to evolve according to the stepwise mutation 
model: when a mutation occurs, the number of repetitions of the mutated gene 
increases or decreases by one unit with equal probability. For each configuration 
(defined in terms of loci numbers), we generate 100 observations for which the 
mutation rate /i is common to all loci and set to 0.005. In these experiments, 
both scenarios have a single parameter, the mutation rate /x. We chose a uniform 
prior distribution Z//[0.0001, 0.01] on this parameter fi. 

For the ABC analysis, we use three summary statistics associated to the ((5^)^ 

Let xi^ij 



distances (Goldstein et al. 1995 Cooper et al. 19991. Let xi^ij be the repeated 
number of allele in locus 1 = 1,.. . , L {L = 5, 50, 100) for individual i = 1, . . . , 100 
(corresponding to 50 diploid individuals) within population j = 1,2,3. The ((5/i)^ 
distance between population ji and j2, denoted by (<5^)^ , is: 



100 



L ^ I 100 ^ 

1=1 \ ii=i 



a;/, J 



1 ji 



1 

Too 



100 

i2 = l 



Let us consider two copies of the locus I with allele sizes xi^i^-^ and xi^i'^j^, 
and assume that the most recent time in the past for which they have a common 
ancestor, defined as the coalescence time Tj^ i is known. The two copies are then 
separated by a branch of gene genealogy of total length ^tj-^j^. As explained in 
Slatkin ( 1995 1, according to the coalescent process, during that time the number 



of mutations is a random variable distributed from a Poisson distribution with 
parameter 2^Tj^j^. Therefore, if the stepwise mutation model is considered, we 
get (under models 1 and 2) 

In addition, if ji = 1 and j2 = 2, we have (under models 1 and 2) 



and 



Moreover, 



E 



E{ri,2} = (2iVe+i') , 



100 ^ 100 

^M,i-Y^E^M.,i + i^E^'.-. 

ii = l ii=l 



^ 100 ^ 100 



100 



82 = 1 



i2 = l 
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Table 1. Theoretical expectations 
of tlie {Sfi)l^i statistics under both 
models 

Model 1 Model 2 
E {(5/1)2,3} 2/t2i 




E 



1 ^ y 

^2 = 1 / 

Ton E -^^'-1 - Too E ^ = 4^eM + 2/.t' . 



100 ^ ■ 100 

41=1 12 = 1 




The coalescent process associated to the stepwise mutation model gives 
and then 

^{{5^i)l^]=2^lt' . 

We can apply the same type of reasoning to the other distances, and if fii 

denotes the mutation rate under model i, we get the following table: 

Therefore, whatever model the data originates from (whether it is model 1 
or model 2), the Bayes factor based only on the distance (i5/J.)i 2 as summary 
statistic does not converge. Indeed, if = /Lt2, we get the same expectation 
on the first line of Table [l] The same occurs when only (i5/i)f 3 (resp. {5^x)\^^) 
is used. Indeed, in that case, if /ii = 2^2 (resp. = ^2) we get the same 
expectation on the second (resp., the third) row of Table [ij Now, if either two 
or three of the distances are used, the Bayes factors converge. Indeed, in these 
settings, no value of fj,i and ^2 can produce equal expectations. 

Figure |8] shows how the empirical results confirm this theoretical analysis. 
Even the medium case of 50 loci indicate whether the use of the corresponding 
summary statistic(s) is valid or not. Under both models, the ABC computations 



have been performed using the DIY-ABC software (Cornuet et al. 2008). 



5. Discussion 

This paper has produced sufficient conditions for a summary statistic to produce 
a consistent or an inconsistent Bayesian model choice. It thus brings a clear an- 



swer to the question raised in Robert et al. (20111, who was alerting the ABC 
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Fig. 8. Comparison of the distributions of the posterior probabilities that the data is from 
model 1 for 5, 50 and 100 loci (left, centre, right, resp.) when the summary statistic in 
the ABC algorithm is made of some different (Sfi)^ distances. The ABC algorithm uses 
2 X 10^ proposals (10"' for each model) from the prior and selects the tolerance e as the 
0.5% distance quantile 



community on the potential pitfalls of an uncontrolled use of ABC approxima- 
tions to Bayes factors. The central condition that the true asymptotic mean of 
the summary statistic should not be recovered under the wrong model if model 
choice is to take place (in a convergent manner) is both natural, in that the 
asymptotic normality implies that only first moments matter, and fundamental, 
in that it drives the choice of summary statistics in practical settings, first and 
foremost for the ABC algorithm. Indeed, Theorem [T] implies that estimation 
statistics should not be used in ABC algorithms aiming at model comparison, 
unless their expectation can be shown to differ under both models. This means 
that (a) different statistics should be used for estimation and for testing and (b) 
that they should not be mixed in a single summary statistic. Note that the dis- 
tinction differs from the sufficient versus ancillary opposition found in classical 
statistics (Cox and Hinkley 1994) in that it is enough that the summary statis- 
tic r„ has a different asymptotic mean under both models. In addition, and as 
shown in the normal-Laplace example in Section |3.1[ some ancillary statistics 
may not be appropriate for testing. 

At a methodological level, the classification of summary statistics resulting 
from the present study is paramount: when comparing models with a given 
range of potential summary statistics, the expectations of the various summary 
statistics can be evaluated by simulation under all models. For instance, in ABC 
settings, the production of pseudo-data is a requirement for the implementation 
of the method; it is therefore quite straightforward to test via a preliminary 
experiment whether the condition of Theorem [l] holds. 
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Nor the final choice of summary statistics as in Fearnhead and Prangle ( 2012 ), 
neither the comparison with alternative model comparisons techniques such as 
ABC^ ( [Ratmann et al. 2009) are covered in the current paper. These are most 
obviously issues worth investigating and they potentially constitute gems for 
future development in the area. 



Acknowledgements 

Part of this work was done when the second author (NSP) was visiting Dauphine 
and CREST and he thanks these institutions for their warm hospitality. All 
authors but the second author are partially supported by the Agence Nationale 
de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 2009-2012 
projects Bandhits and Emile. The second author is partially supported by the 
NSF grant 1107070. The authors are grateful to the whole editorial panel for 
their supportive and constructive comments. 



Appendix 1 

Proof of LemmaUl 

Recall that G„ is the true distribution of T". Let us first assume that inf{|/io — 
fJ'iidi)]', 9i G Oi} = and let S„,i be as defined in assumption [A3] Fix constants 
e,U > and let AI^ be such that 

q({X eR'' -.IX] > M,}) = e. 



Note that Ms goes to infinity as S goes to 0. Consider the set £"„ defined in Assumption 
[A4][ From ^ we have for all T" £ E„, 



m,{T")> / g,{T"\9,)n,{e,)de, 



where the first inequality follows from the definition of En, the second from assumption 



[A3] and the last from assumption [Al] Finally, since G„{Enr\{vn\{t — fio)\ > Ms}) > 
1 — e — 2e = 1 — 3e, there exists Ce > such that for n large enough 

G„ [m.(T") > c, p„(T"X''*] > 1 - 3e. (12) 

We now obtain an upper bound for mi{T"). Using ^ we write. 

As before fix 5 > and let Ms be a constant such that 

G„(1T" - Atol > Msv-^) < 35/2, 
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for n large enough. Applying Markov's inequality and Fubini's theorem, we obtain 
that, for all e > 0, 

G„( J g,{T"\6^)7T,{e,)d0, > e(7„(T")7r,(S'„,0) 

^ -gn^{t)g^{t\e,)dt-K,{e,)dei 

(13) 



< G„[|T" -Mol > A/a<'] 
1 



g,{t\e,)dtTT,{ei)dei 



< 35/2 + ^^^4 < 25, 



given the conditions imposed by [A2] and [A3] when n is large enough. 
We can represent J^n^i as a finite disjoint union of the following sets: 



Now we have 



J^n,i = U ' ^ JoVn, for some Jo G N , 

= S'n,,((j + l)M4)nS'„,,(jM^)", j<J„, 

Uj„+i = j;,, nS'^,,(M4j„) . 



/ g^{T'^\e,)n,{e,)de,= Y. g^{T^^\e^)■K^{e,)de, 



(14) 



Set Co = J2T=2j~'-°''~'^'''^^ < +°° since a, > 2 + d,. Then, if j = 0, Ho = S„,,{M5) 
and if is a constant such that K > di we obtain 

G„ / g,{T''\e,)Tv,{6,)dei> Alfg4T")v-''^ 



ni{Sn,i{Ms)) + S 



0{Mt^''') + 5, (15) 



where the last inequality follows from Q in [A3] Since limsup^^o = oo, the bound 
in ( |15| l goes to as 5 goes to zero. Using assumption | [A2]"| and following exactly the 
same argumentation for (131, i.e. Markov's inequality and Fubini's theorem,we obtain 
that for < j < Jn, 



cov 



g,{T"\9,)n,{0,)d9, > Mi" g„iT")v-/' j^"'-"^^'^} n [vu\T" - ^io\ < 
/ G,,„ (|T" - fi{e,)\ > {j - l/2)Afi<>.) Me^)d9i 

(16) 
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for n large enough, and similarly 



( / g,{T"\e,)MO^)de,>gn{T")v;:''^' 



(17) 



+ G„ KlT"-^o| >Mi) 
< 35/2 + «^*""" < 25, 



for n large enough, under assumption [A3] Combining the above inequalities (131, 
(16 1, and (17 1 with (141, we obtain for n large enough, 

G„( J g,{T"\e,)-K,{e^)de, > (2Mf + l)g„(T")«-''') 

< G„ (^v„\T'' - Mol > ^Ms^ + M^^-" 

which can be made arbitrarily small by choosing 5 small enough. Combining the above 
with ( 13 1 implies that 



I 



yny-i ) 



The above estimate together with the lower bound obtained in ( 12 1 proves the first 
claim (Equation ^) of Lemma [l] 

Now suppose mi{\ij,i{8i) — /io|;6'i G 6^} > 0. Then there exists jo > such that 
Sn,i{j Vn) = for all j < jo. An identical computation as in (17 1, together with (13 1, 
yields 



G, 



< GnMT" - flo\> Ms) 



+- 



< 35, 



G,,„ (IT" - ^,(6(,)1 > io^'n/2) n,{e,) de, + 25 



for all n large enough and e > 0. This proves the second claim (Equation (|8|) of 
Lemma [T] □ 
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